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Abstract. Distance-based approaches in phylogenetics such as Neighbor- Joining 
are a fast and popular approach for building trees. These methods take pairs of 
sequences from them construct a value that, in expectation, is additive under 
a stochastic model of site substitution. Most models assume a distribution of 
rates across sites, often based on a gamma distribution. Provided the (shape) 
parameter of this distribution is known, the method can correctly reconstruct 
the tree. However, if the shape parameter is not known then we show that 
topologically different trees, with different shape parameters and associated 
positive branch lengths, can lead to exactly matching distributions on pair- 
wise site patterns between all pairs of taxa. Thus, one could not distinguish 
between the two trees using pairs of sequences without some prior knowledge 
of the shape parameter. More surprisingly, this can happen for any choice 
of distinct shape parameters on the two trees, and thus the result is not pe- 
culiar to a particular or contrived selection of the shape parameters. On a 
positive note, we point out known conditions where identifiability can be re- 
stored (namely, when the branch lengths are clocklike, or if methods such as 
maximum likelihood are used). 



Allan Wilson Centre for Molecular Ecology and Evolution, 

Biomathematics Research Centre, 

University of Canterbury, 

Christchurcli, 

New Zealand 

Email: m.steel@niatli.canterbury.ac.nz 



1991 Mathematics Subject Classification. 05C05; 92D15. 

Key words and phrases, phylogenetic tree, distance-based methods, gamma distributed rates, 
identifiability. 

1 



2 



MIKE STEEL 



1. Introduction 

Stochastic models that describe the evolution of aligned DNA sequence sites 
are fundamental to most modern approaches to phylogenetic tree reconstruction 
[9]. Making these models more realistic usually requires introducing additional 
parameters. However, this raises the prospect that one might lose the ability to 
estimate a tree if one has to rely on the data to estimate all the parameters in 
the model. This could occur for various reasons - for example, it may be that 
two different trees could produce exactly the same probability distribution on site 
patterns for two appropriately selected settings of the other parameters in the 
model. Such a scenario would be a problem for any method of tree reconstruction 
(including maximum likelihood and Bayesian methods) as it would mean that in 
some cases, one could not distinguish between two trees even with infinitely long 
sequences. This loss of statistical 'identifiability' has been demonstrated for certain 
types of DNA substitution models, including rates-across-sites models [19] and, 
more recently, simple mixture models |13[ . On the positive side, a number of 
identifiability results have also been established for suitably constrained models 
(see, for example, [TJl [3 H H [H [18] ) . 

In this paper, we are interested in a phenomenon that is related to, but dif- 
ferent from the loss of statistical identifiability, since it is method-dependent. We 
will describe a situation where pairwise sequence comparison methods can fail to 
distinguish between trees, even though more sophisticated methods such as ML 
can. Thus, the models are statistically identifiable, as far as the tree parameter is 
concerned, but only if one uses the full matrix of aligned sequence information and 
not just pairwise sequence comparisons. Specifically we consider tree reconstruc- 
tion when sequences sites evolve under a model in which site rates have a gamma 
distribution, but where the (shape) parameter of the gamma distribution is not 
known. In this case, if one uses all the aligned sequence data, or at least 3-way se- 
quence comparisons then, for DNA sequences one can recover the shape parameter 
in a statistically consistent way, and thereby the underlying phylogenetic tree, by a 
recent result of AUman and Rhodes [T]. However if one just uses pairwise sequence 
comparisons we show that that two different trees can produce exactly the same 
pairwise sequence comparisons; moreover this can happen for any different choice 
of shape parameters for the two trees (by selecting the branch lengths on the two 
trees appropriately). 

The intuition behind this limitation on pairwise sequence comparisons has been 
nicely summarized by Felsenstein ([9J, p. 175): the rate at which a site is evolving 
affects all the taxa, but this constraint is not reflected by a method that is based on 
pairwise comparisons, and so, for example, "once one is looking at changes within 
rodents it will forget where changes were seen among primates." 

Before describing our results we mention some earlier papers that described re- 
lated by different phenomena. Baake [5] considered a model in which half the sites 
are invariable and the remaining sites evolve under a general Markov model. Al- 
though this model (and the tree) is generically identifiable using all the sequence 
information (as recently shown in [3]) Baake showed that two trees can produce 
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identical pairwise sequence comparisons. The non-indentifyability of divergence 
times on a fixed tree under various rates-across-sites models has also been re- 
cently investigated by Evans and Warnow [^. Finally we note that our result 
that distance-based methods can be misleading for tree inference complement some 
earlier work [6], [llj which highlighted a different result in which distances can 
perfectly 'fit' one phylogenetic tree when the full sequence data support a different 
tree. 



In sequence-based approaches to phylogenetics, the data usually consists of a 
collection of n sequences s^, s^, . . . , s", each of length N, and where each sequence 
site takes values in some state space. We will suppose that there are r states, 
and denote them by greek letters fj,, v throughout - for example, for aligned DNA 
sequence data r — \ and the state space is the four DNA bases (A,C,G,T). Given 
the aligned sequences, biologists seek to infer a phylogenetic tree T, whose leaves are 
labeled by {1, . . . , n} and which describes the evolution of the sequences from some 
unknown common ancestral sequence (leaf i corresponds to the extant taxon from 
which sequence has been obtained). For further background on phylogenetics, 
the reader may consult [9l [15] . 

Given two sequences = (si, S2, . . . , sa?) and = (s'j^, S2, . . . , s^) let Jij be 
the r X r matrix whose poj-erAxy is the proportion of sites where sequence s* is in 
state [I and sequence is in state v. The proportion of sites where sequence 
and differ, (5^ (the normalized sequence dissimilarity) is therefore the sum of the 

off-diagonal entries of Jy ; more formally, (5^ = ^Yl!k=\^^ ■ ^\ 7^ ^"^On)! 
where tr refers to matrix trace (the sum of the diagonal entries). 

Given a collection of sequences s^, s^, . . . , s", each of length N , one can easily 
derive the collection of pairwise J-matrices : «,j G {l,...,n}. This reduc- 
tion process, from aligned sequences to pairwise comparisons, is highly redundant 
(for typical values of n) since it reduces the frequencies of r" site patterns to (2) 
comparisons of sites pattern frequencies. The further reduction to the (S values 
involves even more redundancy ^16j. Despite this, it is well known that these re- 
duced matrices (and sometimes just the 8 values) provide a statistically consistent 
way to estimate the underlying tree, under simple models of DNA site substitution. 
This follows by combining two well-known facts. 

Fact One: Under the assumption that the aligned sequence sites evolve i.i.d., 
the law of large numbers tells us that the Jy matrices (and thereby the (5y values) 
converge in probability to their expected values as the sequence length N becomes 
large. 

To explain this further we introduce two key definitions: For i, j G AT, let be 
the expected value of Jij - thus, Jij is an r x r matrix whose /zi^-entry is 
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for each pair of states /z, and any given fc; and let 

for any given k. In words, Jij is the matrix whose entries describe the joint proba- 
bihty that at any given site the sequences s* and are in specified states, while dij 
is simply the probability that these states are different at a given site. By definition, 
dij = 1 - tr{Jij). 

With this notation, Fact One can be restated as the condition that, for all 
i,j e X: 

4 i^. J,, and <5., i^. d„ 
where — > denotes convergence in probability as iV — > oo. 

The second result required to show that the values estimate the tree consis- 
tenty is that for many models the values can be transformed to obtain a function 
on pairs of leaves that is additive. Recall that a function lij on pairs of leaves of a 
tree is said to be additive on a tree T if one can assign a positive real number Ze to 
each edge e of T so that kj is the sum of the numbers assigned to the edges on the 
path connecting the two leaves on the tree. That is: 

(1) hj — ^ ^ ^ej 

eep(T;i,j) 

where p{T;i,j) denotes the edges on the path in T connecting i and j. This 
additivity condition implies that the tree T can be uniquely recovered from the hj 
values (see e.g. jT5,). With this in mind we have: 

Fact Two: Under various models of sequence site evolution, a distance function I 
on X that is additive on the underlying tree can be computed from the J matrices 
(and sometimes just the d values). 

The two main models for which Fact Two is known to apply are (i) the general 
Markov process, for which the transformation Jij i— > — log(det( Jij)) is additive, and 
(ii) the general time-reversible (GTR) model with any known distribution of rates 
across sites. In this latter case - which is the one of interest in this paper - one can 
transform the J matrices to obtain an distance function I on X that corresponds 
to the expected number of substitution ('evolutionary distance') between i and j - 
and which is therefore additive. For a GTR model, with a distribution of rates 
across sites this transformation [2^ is: 

where is the moment generating function of the distribution of rates across 
sites, and where 11 — diag(7r) is the diagonal matrix whose leading diagonal is the 
vector TT = [iTfj] of the frequencies of the r states. For the GTR model (or any 
submodel) the matrix is symmetric [20j and Ju = 11 for each i. 

Combining Fact One and Fact Two gives: 

-tr{IlM^\U-\j,j))^kj. 

and so the Jij values allow us to reconstruct the underlying tree from sufficiently 
long sequences. Indeed even if we don't know the stationary frequencies of the 
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states (the matrix 11) we can still recover the tree, since 11 is determined by (the 
row sums of) J^, and so if we let 11^ denote the corresponding empirical state 
frequencies (determined by the corresponding row sums of Jy ) then we have: 

Thus, if for each pair i,j we derive an estimate lij of evolutionary distance (Z^) 
by cither maximum likelihood estimation or by the 'corrected distance' formula: 

(2) = ~tr{tlM^\fl~^J,,)) 

then these estimated values will converge to the true Uj values as the sequence 
length N grows, allowing for statistically consistent reconstruction of the tree by 
using fast distance-based tree reconstruction methods. 

For some GTR models it is also possible to transform just the Sij to obtain kj - 
for example, under the simple symmetric 4-state model (the Jukes-Cantor model) 
the transformation is: 

(3) hj = -Jm^1(1 - ^d,,). 

For models in which lij can be expressed as a function of dij one can use S in place 
of d to estimate lij (for certain models, such as the Jukes-Cantor model, this leads 
to the same hj estimates as a pairwise maximum likelihood estimate, but for more 
complex models this need not be the case). 

The snag in this otherwise appealing story is that it assumes that we know 
the distribution ^ of rates across sites - what happens if is unknown or has 
parameters that require estimation? If no constraints are placed upon !^ then 
identifiability of the tree can be completely lost [T^] . It is therefore fortunate that in 
molecular systematics ^ is typically described by a simple parametric distribution. 
In particular, the gamma distribution has a long and popular history in models 
that describe the variation of substitution rates across DNA sequence sites |21| . 
Today, a common default option is the 'GTR-fF-f-r model in which each site is 
either invariant (with some probability), or it evolves according to a general time 
reversible Markov process that proceeds at a rate selected randomly from a gamma 
distribution. In this paper we will ignore the invariable sites, since our main result 
fTheorem lS.ip will automatically imply a corresponding result when invariable sites 
are present. Moreover, we may (without loss of generality) assume that the gamma 
distribution is normalised so that its mean is equal to 1 and so there remains just 
one parameter - the 'shape' parameter, k. 

We will show that any two different shape parameters can provide exactly the 
same J matrices on a pair of topologically distinct trees (with appropriately assigned 
branch lengths). Consequently, using just pairwise comparisons (the J matrices) 
to infer phylogeny from the resulting data, without prior knowledge of the shape 
parameter is potentially problematic - either of the two trees could describe the data 
much better than the other if one were to select the shape parameter appropriate 
for that tree. Thus, a biologist exploring data by seeing the effect of varying k might 
note that for one value of k his/her data fit a tree perfectly. The result described 
here shows that it could be dangerous to stop at this point and report the tree. 
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as there may well be another value of k for which the pairwise sequence data (or 
distance data) fit a different tree perfectly. Using all the data (i.e. not reducing 
to pairwise comparisons) will overcome this problem for a gamma distribution as 
established recently by AUman and Rhodes (who also pointed out errors in an 
earlier approach from [14j). 



3. Results 



In this paper we consider a particular type of reversible stationary markov pro- 
cess, called the equal input model. In this model, the rate of substitution does not 
depend on the current state, and when a substitution event occurs, the new state 
is selected according to the stationary distribution of states, which we encode by 
the vector tt. Thus the rate matrix R is defined by the condition R^^, = n^, for all 
V ^ fjL. In the case of r = 4 states, this model has been called the 'Tajima-Nei equal 
input model' or the 'Felsenstein 1981 model'; when, in addition, tt is uniform, it 
is the known as the 'Jukes- Cant or' model. For more mathematical background on 
the equal input model, see, for example, |15j . Although the equal-input model is a 
special case of the GTR model, we have chosen it because it is simple enough to 
allow tractable exact calculations, yet without being overly simplistic (for example, 
it allows arbitrary stationary frequencies for the states). 

Under the equal input model, and with constant-rate site evolution we have: 

j-^^ J^,l, ^ |7r^7r^,(l - exp{-kj/j)), H IJ. ^ 

+ (1 - 7r^)exp(-Zij77))), if ^ = J/, 

where hj is the expected number of substitutions on the path connecting i and j in 
T (an additive distance) and 7 = ""^ (this number is the expected normalised 

sequence dissimilarity for saturated sequences - for example, in the Jukes-Cantor 
model, it takes the value 1 — 4 ■ {j)^ = |). More briefly we can write: 

(5) Jt/ = O/xi. + bf,^ exp(-Zy/7), 

where a^jy, 6^,^ are constants that depend on the pair /i, i' and the vector tt. 

If we now impose an associated distribution of rates across sites on this equal- 
input model, in which case each site evolves according to the same equal input 
model, but with a rate selected randomly according to In this case ([5]) becomes: 

(6) Jt/ = a^. + h^,Ms{-kj/-f), 

where M^{x) is the moment generating function for and When ^ is a gamma 
distribution of rates across sites with shape parameter k and mean 1 we have: 



and so Eqn. ([6]) becomes: 



(7) Jr=«M. + ^M.(l + ^)"' 



Now, suppose we have two topologically distinct binary phylogenetic X-trees T 
and T', where T has branch length I and gamma distribution of rates across sites 
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(with mean 1) with shape parameter fc, while T' has branch lengths /' and gamma 
distribution of rates across sites (with mean 1) with shape parameter /c', where 
k' ^ k. We can now state the main result of this paper. 

Theorem 3.1. Consider a fixed equal imput model on r > 2 states. Then for any 
k,k' > with k ^ k' and for any binary phylogenetic X-tree T with four or more 
leaves there exist a topologically distinct binary phylogenetic X-tree T' , and strictly 
positive branch lengths I for T and V for T respectively, so that the matrices of 
joint pairwise distributions Jij and J[, agree for all i,j^X. 



Remarks: The significance of this result for phylogenetic reconstruction is that 
it shows that even if one uses pairwise sequence comparisons, the choice of the 
correct shape parameter for the gamma distribution is essential ~ if we selected 
shape parameter k, the corrected distances (obtained by ML estimation or by ([2])) 
would fit T perfectly as the sequence lengths become large; while if we selected 
shape parameter k' , the corrected distances would fit T' perfectly for sufficiently 
long sequences. Notice that the pair (T, k) and (T', k') fit the data produced by 
either tree (with its associated shape parameter) equally well (i.e. perfectly in the 
limit as the sequence lengths become large). Moreover, our result assumes that 
the base frequency vector (tt) is known and the same for both trees. Notice also 
that Theorem 13.11 automatically implies that any distance correction method that 
transforms the sequences dissimilarities (the d values) will be unable to distinguish 
between T and T' if the shape parameter is unknown. 

Proof of Theorem \3.1\ 

For a given assignment of branch lengths / and gamma shape parameter k for T 
let Jij denote the induced pairwise distribution matrix, defined by Eqn. ([7]), for 
each i,j. Similarly, for a given assignment of branch lengths V and gamma shape 
parameter k' for T' let J[j denote the induced pairwise distribution matrix for each 
By symmetry, we may assume (without loss of generality) that k > k' . Let 

(8) -^-1 + 1;^' 



(9) + 



and let 

From Eqn. ([7|) and the notation of ([8]) and ([9]), we have the following fundamental 
identity: 

(10) J^J = J[, if and only if t[. = (r,,)" 



We will first prove Theorem 13.11 in the case where \X\ — A, and then extend the 
proof to the general case. 

The case \X\ = 4: Consider the tree T with branch lengths given in Fig. 1(a), 
and the tree T' with branch lengths given in Fig. 1(b). By ([T]) we have, for example. 
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Figure 1. (a) Tree T with branch lengths I; (b) Tree T' with 
branch lengths I'. 

h2 = h + h, and Z13 = /i + Z3 + Z5. Let l[j be the corresponding I' values induced 
byT'. 

Notice that if we set 

(11) x.:=i + |^forz = l,...,4 
and set 

(12) e:=f. 

fc7 

Then for each distinct pair i,j we have: 

jx, + x,, if = {1,2} or {3,4}; 

Tij — < 

I Xi + Xj + e, otherwise, 
thus Tij is additive on T (similarly, r/^ defined by ([9]) is additive on T'). 



(13) 



We will describe an assignment of positive branch lengths for T, and then an 
assignment of branch lengths for T' . Firstly, however we state a convexity lemma; 
for completeness a proof is provided in the Appendix. 

Lemma 3.2. Suppose f is twice- dijferentiable, and that f" is strictly positive on the 
positive reals. Ifu' < u < v < v' andu+v = u'+v' then: f{u')+f{v') > f{u)+f{v). 

We will apply Lemma 13.21 twice during the proof, using the function f{x) — x^ 
which satisfies the hypotheses of this lemma, since f"{x) — p{p— l)x''~^ and p > 1. 

Returning to the assignment of branch lengths for T, let L > ^ and t e [0, 1], 
and select li,...,l5 so that {xi,X2,X3,X4,e) defined by PT|) and (fT^ satisfy the 
following system of inequalities: 

(14) max{xi,X2,X3,X4} = L 

(15) X3> X4 + t; 

(16) X2 < X4; 

(17) Xl + X3 < X2 + X4; 
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(18) Xl + X4 < X2 + X3; 

(19) \xi — Xj \ < 1 for all i, j; 
and 

(20) (xi + 2:4 + eY + {X2 + X3 + e)P ^ {xi + X2Y + (xg + x^Y . 

We pause to observe that this system (for the five U values) is feasible for ar- 
bitrarily large values of L. For example, we can take xi = L, X2 ^ L + ^^ x^ = 
L + 1,X4 = L + ^ and i = i, to satisfy ((Tl])-((Tn]), and then for i = 1, . . . , 4 let 
h = kj{xi — which is strictly positive since Xi > L > i; then for ^5 there exists 
a positive value of e satisfying (PO)) . To see this last claim regarding e, let 

U ^ Xl + X4; V ~ X2+ X3, 

and 

u' = Xl + X2] v' = X3 + X4. 

Notice that the inequalities (|16p and (|18p imply that u' < u < v < v' and, since 
u + v — u' + v', Lemma [X^ applied to f{x) — x^ gives f{u) + f{v) < f{u') + f{v'). 
Since / is strictly increasing, this implies that there is a finite and strictly positive 
value of e > (and thereby of /s by ^) for which f{u+e) + f{v + e) = .f{u') + f{v'), 
as claimed. 

Next we show that the branch lengths we have assigned for T allows us to assign 
positive branch lengths to T' so Jij — J[j holds for all Define Ay := /(ry ) 
where f{x) = x^ . We will show that there exists an assignment of positive branch 
lengths I' to T' for which the associated vector r' defined by Q satisfies: 

(21) A,, =t(,. 

In view of (|10p this will establish the theorem in the case \X\ =4. Let 

(22) "5*12134 A12 + A34, £'^3124 ■— A13 + A24, and 5^4123 := A14 + A23. 
If we let 

u — Xl + X3 + e; V — X2 + X4 + €, 

and 

u' = Xl + X4 + e; v' = X2 + X3 + e, 

then p5)) and p?)) imply that u' <u <v <v' and, since u + v = u' -\-v' , Lemma [3^ 
gives f{u) + f{v) < f{u') + f(v'). In view of ^ and ^ this imphes that: 

(23) 5'^3|24 < <S'i4|23- 

Moreover, Eqn. ([^D]) implies that 

(24) 'S'l2|34 — '5'l4|23- 

Equations and imply that Ay can be realized as a sum of real-valued 
branch lengths on T' by assigning positive interior branch length (call it e'), and 
real- valued (possibly negative) pendant branch lengths (by [12] )• We will first show 
that these four pendant branch lengths are not only positive, but also strictly greater 
than i provided L is chosen sufiiciently large. For i £ {1, 2, 3, 4} if we let Xi denote 
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the branch length of the edge incident with leaf i, then — ■^{Kj + ^ik ~ ^jk) for 
any choice j, k for which j, fc}| — 3. Now, from ITU) and ([T9)) . we have 

Ay + Kk - Xjk > /(2L) + /(2L) - /(2L + 2 + e) 

and so we can select a value of L that is sufficiently large to ensure that > ^ for 
i — 1 . . . , 4. We can now assign the positive branch lengths to T' as follows. Let 
l'^ = k'je' and for i e {1, . . . , 4} let 

K = ^'7(A, - ^) > 0. 
With these branch lengths we have (from (|2T|) ). 

for all i, j. By Eqn. (fTO]) . this establishes the theorem in the case where \X\ = 4. 

The case \X\ > 4: To extend the proof to larger trees we require a further 
lemma, which is based on the following definition. Given a rooted phylogenetic 
tree, t with root vertex p (which we assume is a vertex of degree at least two) and 
associated branch lengths /, we say that the branch-lengths on t are clock-like if the 
sum of the branch lengths from p to any leaf takes the same value for each leaf, 
which we will denote by h(t, V) (the 'height' of p). We will use the following lemma, 
for which a proof is provided in the Appendix. 

Lemma 3.3. Let t be a rooted phylogenetic tree with at least two leaves. Suppose 
that the branch-lengths for t are clock-like, and that we have a gamma distribution 
of rates across sites (with mean 1 ) and with shape parameter k. For any other shape 
parameter k' there exists a unique associated vector of branch lengths /' for t that 
are clock-like and such that the induced J' matrices satisfy the condition: 

(25) J^j — Jij for all leaves i,j oft. 
Moreover, for this vector V , we have: 

(26) h{t,l') = \k'^{-l + {l + ^^Y). 

Returning to the proof of the theorem, let T be any binary phylogenetic tree with 
more than four leaves, and select any interior edge e of T. Consider the four rooted 
subtrees ^1,^2,^3,^4 of T that result from deleting this edge and its two endpoints, 
as shown in Fig. 2(a). Let T' be the tree obtained from T by interchanging the 
subtrees t2 and ts, as shown in Fig. 2(b). Let I and V be strictly positive branch 
lengths for the two quartet trees of Fig. 1 for which we have Jij = J/^ for all 
i,j € {1, 2, 3, 4} (by the case of the theorem established already for \X\ = 4). 

We now assign branch lengths to T and T'. For tree T assign length Z5 to edge e 
indicated in Fig. 2(a), and for the tree T' assign length Zg to edge e of T' indicated 
in Fig. 2(b). If ti consists of just a single leaf, we assign length li in T and l'^ in T'. 
Thus it remains to specify how we assign branch lengths to the subtrees ti when 
these trees contain more than one leaf, and to the edge a that connects ti to e. 
If we regard ti as a rooted binary phylogenetic tree (for which the root pi is the 
vertex adjacent to an endpoint of e, as shown in Fig. 2), we assign branch lengths 
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to U that are clock- like and for which h{ti, p) = where is any strictly positive 
number that is less than Zj and satisfying the condition: 

(27) ifc'7(-l + (l + ^n<C 

Then assign edge length k — > 0. Note that we can select to satisfy (ITT)) 
since the left-hand side of (P7)) converges to zero as 0. For tree T' assign i- 

branch lengths that are clock- like and satisfy ((25| of Lemma [3731 (for t = ti,t' = t'^), 
and assign edge length V- — h{t'-,p) which is strictly positive by (P7)) . We claim 
that Jij = J[- for all We have just shown that this holds whenever {j, j} are 
leaves in the same subtree (ii, t2, is or t^)^ thus it remains to check the claim when 
i and j lie in different subtrees, say tr,ts. In this case the condition that {T,l) 
and (T', I') satisfy the theorem in the case |X| = 4 and the fact that the distance 
between i and j in T is l^s and in T' is l'^^ (according to the way the branch lengths 
have been assigned) establishes case (ii). This completes the proof. □ 



4. Concluding comments 



Our result shows that rate variation across sites can indeed provide an "inherent 
limitation that is worrisome" [5] for methods that rely solely on pairwise sequence 
comparisons. Despite the limitation of distance-based phylogenetic reconstruction 
imposed by Theorem 13. 1[ there is one situation where distances suffice to recover a 
tree under a gamma rate distribution across sites, even when the shape parameter 
is unknown. This is when the underlying branch lengths on the tree obey are clock- 
like (i.e. obey a 'molecular clock'). This follows from the monotone relationship 
between d and I described in (fTO|) . which implies that the d values (corrected or 
not) will be ultrametric and additive on the underlying tree. 

Also, our result does not imply that tree reconstruction is hopeless without prior 
or independent knowledge of the shape parameter, since AUman and Rhodes P 
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have established that idcntifiabihty holds for this model (generically for all r > 2, 
and exactly when r = 4 which is the case that applies for DNA sequence data) and 
so methods such as maximum likelihood will be statistically consistent. Moreover, 
their result shows that just 3-way sequence comparisons are sufficient to identify the 
shape parameter. This suggests that it may be possible to develop statistically con- 
sistent but fast modifications of distance-based tree reconstruction methods (such 
as neighbor joining) that some allow triple-wise calculations. 

Finally, it would also be interesting to check whether Theorem 13.11 remains true 
if one replaces the equal input model by the GTR model with any fixed (and given) 
rate matrix R. This seems quite likely, though the calculations appear to be more 
involved when the rate matrix has many different eigenvalues. The question of 
whether T' can have an arbitrary topology different to T in Theorem [XT] (i.e. not 
just a nearest-neighbor interchange of T) could also be of interest. 
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6. Appendix 



Proof of Lemmas [HI and [3T3l 

Proof of Lemma \3.Sl By a JVIaclaurin series expansion, we have: 
f{u') = f{u -t)^ f{u) - tf'iu) + iiV'W, 
where G [u', u] and: 

f{v') - f{v + t) = f{v) + tf'iv) + h'f"{e% 

where 6' G Thus: 

/(«') + fiv') = fin) + fiv) + t{f'{v) - f'{u)) + \e{f"{e) + f"{9')). 

Now f'{v) — f'{u) > since /' is increasing (by the positivity of /"), and so, since 
t > 0: 

fiu') + f{v') > f{u) + f{v) + \t\f"{e) + no')) > f{u) + f{v) 
where the last inequahty follows from the positivity of /". □ 

Proof of Lemma WT^ Since the branch lengths of t are clock- like, it follows that 
I and hence t is an ultrametric, i.e. for any three leaves of t we have: 

Tij < niax{Tifc,rjfc}. 

It follows that tP (where p = k/k') satisfies precisely the same ultrametric condi- 
tions as r and so we can assign (unique) positive branch lengths to t that realize tP 
and which are clock-like. From (fTU]) . these branch lengths satisfy (P5|) of Lemma l3.3l 
Moreover, since t has at least two leaves, we can select two leaves (say u, v) so that 
the path connecting u and v contains p. Then lij ~ 2h{t,l), and I'^j ~ 2h(t,l'). 
Now = {TijY and so, by ([8]) and ([9]), we have: 

^ 2h{t,l) ^^_^ ^ 2h{t,l') ^ 
k"/ k'j ' 

from which equality (|26p of Lemma [5T51 now follows. □ 



